{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "# Python 文件输入/输出\n",
    "\n",
    "**文件**不单单指磁盘上的普通文件，也指代任何抽象层面上的文件。例如：通过URL打开一个Web页面“文件”，Unix系统下进程间通讯也是通过抽象的进程“文件”进行的。由于使用了统一的接口，从而统一了各种抽象类型及非抽象类型文件的操作方式。\n",
    "\n",
    "所有程序都要处理输入和输出，需要在非易失性介质上做持久化存储和检索数据，所以文件操作的重要性无需多言。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "## 读写文本数据\n",
    "\n",
    "Python模仿Unix系统来进行文件操作，读写一个文件之前需要打开它：\n",
    "\n",
    "`file_obj = open(file_name, mode, buffering=-1, encoding=None, errors=None, ...)`\n",
    "\n",
    "* `file_obj`是`open()`返回的可迭代的**文件对象**，可以用`for`循环遍历；\n",
    "* `file_name`是要打开文件的文件名，非当前目录下的文件要指明路径；\n",
    "* `mode`指明文件类型和操作的字符串（文件读取模式）；\n",
    "* `buffering`用来设定缓冲模式，默认值为`-1`，即使用系统默认缓冲模式；\n",
    "* `encoding`用来指定文件编码、解码的编码类型，只针对文本文件有效，默认选择所在系统使用的编码方式；\n",
    "* `errors`用来指定编码错误的处理方式。\n",
    "\n",
    "`mode`第一个字母表明对文件的操作：\n",
    "\n",
    "* `r`表示读模式。\n",
    "* `w`表示写模式。如果文件不存在则新建文件，如果文件存在则重写新内容。\n",
    "* `x`表示在文件不存在时新建并写入文件。\n",
    "* `a`表示如果文件存在，在文件末尾追加写入内容。\n",
    "\n",
    "`mode`第二个字母是文件类型：\n",
    "\n",
    "* `t`（可省略）代表文本类型；\n",
    "* `b`表示二进制文件。\n",
    "\n",
    "一些`mode`参数的详细说明：\n",
    "\n",
    "```\n",
    "# 文件对象访问模式\n",
    "r   只读方式打开           rU   读方式打开，提供通用换行符支持\n",
    "w   以写方式打开           a    以追加模式打开\n",
    "r+  以读写模式打开         w+   以读写模式打开\n",
    "rb  以二进制读模式打开     wb   以二进制写模式打开\n",
    "ab  以二进制追加模式打开   rb+  以二进制读写模式打开\n",
    "wb+ 以二进制读写模式打开   ab+  以二进制读写模式打开\n",
    "\n",
    "# 以r模式打开时，文件必须存在，否则引发错误\n",
    "# 以w模式打开文件时，如果文件存在则清空，不存在则创建\n",
    "# 以a模式打开时，如果文件存在，则从EOF位置追加，否则创建新文件\n",
    "```\n",
    "\n",
    "`buffering`参数模式：\n",
    "\n",
    "```\n",
    "0        不缓冲\n",
    "1        只缓冲一行\n",
    "value>1  缓冲区大小为value\n",
    "value<0  使用系统默认缓冲机制\n",
    "```\n",
    "\n",
    "使用带有`rt`模式的`open()`函数读取文本文件。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "# 将整个文件当作一个字符串读取\n",
    "with open('somefile.txt', 'rt') as f:\n",
    "    data = f.read()\n",
    "\n",
    "\n",
    "# 迭代读取文件的每一行\n",
    "with open('somefile.txt', 'rt') as f:\n",
    "    for line in f:\n",
    "        # process line\n",
    "        pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "这里的`with`语句是Python中的上下文管理器（context manager）类型，形式为`with expression as variable`，其主要作用包括：保存、重置各种全局状态，锁住或解锁资源，关闭打开的文件等。\n",
    "\n",
    "你也可以不使用`with`语句，但这时候你必须记得手动关闭文件："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "f = open('somefile.txt', 'rt')\n",
    "data = f.read()\n",
    "f.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "上面的例子也可以用`try...finally...`实现，它们的效果是相同的（上下文管理器封装、简化了错误捕捉的过程）。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "try:\n",
    "    f = open('somefile.txt', 'rt')\n",
    "    data = f.read()\n",
    "finally:\n",
    "    f.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "除了文件对象之外，可以自己创建上下文管理器，只要定义了`__enter__()`和`__exit__()`方法就成为了上下文管理器类型。`with`语句的执行过程如下：\n",
    "\n",
    "1. 执行`with`后的语句获取上下文管理器，例如`open('somefile.txt', 'rt')`就是返回一个文件对象；\n",
    "2. 加载`__exit__()`方法备用；\n",
    "3. 执行`__enter__()`，该方法的返回值将传递给`as`后的变量（如果有的话）；\n",
    "4. 执行`with`语法块的子句；\n",
    "5. 执行`__exit__()`方法，如果`with`语法块子句中出现异常，将会传递`type, value, traceback` 给`__exit__()`，否则将默认为`None`；如果`__exit__()`方法返回`False`，将会抛出异常给外层处理；如果返回`True`，则忽略异常。\n",
    "\n",
    "关于上文管理器的更多内容可参考Python[官方文档](https://docs.python.org/3/library/stdtypes.html#context-manager-types)。\n",
    "\n",
    "为了写入一个文本文件，使用带有`wt`模式的`open()`函数，如果之前文件有内容则清除并覆盖掉。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "# 写入文本数据块\n",
    "with open('somefile.txt', 'wt') as f:\n",
    "    f.write(text1)\n",
    "    f.write(text2)\n",
    "    pass\n",
    "\n",
    "# 重定向打印语句\n",
    "with open('somefile.txt', 'wt') as f:\n",
    "    print(line1, file=f)\n",
    "    print(line2, file=f)\n",
    "    pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "如果是在已存在文件中添加内容，使用模式`at`。\n",
    "\n",
    "文件的读写操作默认使用系统编码，可以通过调用`sys.getdefaultencoding()`来得到。在大多数机器上面默认都是utf-8编码。如果你已经知道要读写的文本是其他编码方式，那么可通过传递一个可选的`encoding`参数给`open()`函数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "with open('somefile.txt', 'rt', encoding='gb18030') as f:\n",
    "    pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "另外一个需要注意的问题是关于换行符的识别问题。在不同的系统平台上，换行符是不同的，例如在Unix下是`\\n`，而Windows中是`\\r\\n`。\n",
    "\n",
    "Python为了解决这个问题，使用了通用换行符支持（Universal NEWLINE Support）。默认情况下，Python会以统一模式处理换行符。在读取文本的时候，Python可以识别所有的普通换行符并将其转换为单个`\\n`字符。类似的，在输出时会将换行符`\\n`转换为系统默认的换行符。通过这种方式，Python屏蔽了不同平台下的换行符差异。\n",
    "\n",
    "如果你不希望这种默认的处理方式，可以给`open()`函数传入参数`newline=''`："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "# 读取时停用换行符转换\n",
    "with open('somefile.txt', 'rt', newline='') as f:\n",
    "    pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "在Unix机器上面读取一个Windows上面的文本文件，里面的内容是`hello, world!\\r\\n'："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "# 开启换行符转换（默认）\n",
    "f = open('hello.txt', 'rt')\n",
    "f.read()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "# 停用换行符转换\n",
    "g = open('hello.txt', 'rt', newline='')\n",
    "g.read()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "这里需要提醒大家，跨平台开发会遇到一些不可避免的问题，不同平台下的**换行符差异**及**路径分隔符差异**等等就是一个特例。Python的`os`模块提供了一些属性值以便于跨平台应用的开发：\n",
    "\n",
    "- `os.linesep`      （当前系统下，下同）用于在文件中分隔行的字符串\n",
    "\n",
    "- `os.sep`           用于分隔文件路径名的字符串\n",
    "\n",
    "- `os.pathsep`       用于分隔文件路径的字符串\n",
    "\n",
    "- `os.curdir`        当前工作目录的字符串名称\n",
    "\n",
    "- `os.pardir`        当前工作目录父目录的字符串名称\n",
    "\n",
    "最后一个问题就是文本文件中可能出现的编码错误。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "# file rain.txt\n",
    "# 下雨天留客天留我不留\n",
    "f = open('rain.txt', 'rt', encoding='ascii')\n",
    "f.read()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "如果出现这个错误，通常表示你读取文本时指定的编码不正确，你需要确认和指定正确的文件编码。如果编码错误还是存在的话，你可以给`open()`函数传递一个可选的`errors`参数来处理这些错误。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "# 使用Unicode U+fffd 替换错误字符\n",
    "f = open('rain.txt', 'rt', encoding='ascii', errors='replace')\n",
    "f.read()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "# 忽略错误字符\n",
    "g = open('rain.txt', 'rt', encoding='ascii', errors='ignore')\n",
    "g.read()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "对于文本处理的首要原则是确保你的总是使用正确的文件编码。当模棱两可时，就使用默认的设置（通常都是UTF-8）。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "## 打印输出至文件中\n",
    "一般我们使用文件对象的`write()`方法写入文本文件。它没有增加空格或者换行符。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "f = open('allo.txt', 'wt')\n",
    "print(f.write('Hello, world!'))  # 返回写入文件的字节数\n",
    "f.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "另外，我们可以指定`print()`函数的`file`关键字参数，将其输出重定向到一个文件中。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "with open('allo.txt', 'wt') as f:\n",
    "    print('Hello, world!', file=f)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "这里需要注意的是文件必须以文本模式打开。`print()`默认会在每个参数后面添加空格，在每行结束处添加换行。\n",
    "\n",
    "可以在`print()`函数中使用`sep`和`end`关键字参数，以你想要的方式输出："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "print(2017, 4, 11)\n",
    "print(2017, 4, 11, sep='/')\n",
    "print(2017, 4, 11, sep='/', end=' <--\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "## 读写二进制数据\n",
    "\n",
    "使用模式为`rb`或`wb`的`open()`函数来读取或写入二进制数据。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "# 将整个文件当作一个字节字符串读取\n",
    "with open('somefile.bin', 'rb') as f:\n",
    "    data = f.read()\n",
    "\n",
    "# 将二进制数据写入文件\n",
    "with open('somefile.bin', 'wb') as f:\n",
    "    f.write(b'Hello World')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "在读取二进制数据时，所有返回的数据都是字节字符串格式的，而不是文本字符串。在写入的时候，必须保证参数是以字节形式对外暴露数据的对象（比如字节字符串，字节数组对象等）。\n",
    "\n",
    "另外，在读取二进制数据的时候，索引和迭代动作返回的是字节的值而不是字节字符串。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "# 文本字符串\n",
    "t = 'Hello World'\n",
    "print(t[0])\n",
    "\n",
    "for c in t:\n",
    "    print(c)\n",
    "\n",
    "\n",
    "# 字节字符串\n",
    "b = b'Hello World'\n",
    "print(b[0])\n",
    "for c in b:\n",
    "    print(c)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "如果你想从二进制模式的文件中读取或写入文本数据，必须确保要进行解码和编码操作。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "with open('somefile.bin', 'rb') as f:\n",
    "    data = f.read(16)\n",
    "    text = data.decode('utf-8')\n",
    "\n",
    "with open('somefile.bin', 'wb') as f:\n",
    "    text = 'Hello World'\n",
    "    f.write(text.encode('utf-8'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "## 文件不存在才能写入\n",
    "\n",
    "当文件不存在时才能写入，不允许覆盖已存在的文件内容。\n",
    "\n",
    "在`open()`函数中使用`x`模式来代替`w`模式来进行处理："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "with open('somefile', 'wt') as f:\n",
    "    f.write('Hello\\n')\n",
    "\n",
    "with open('somefile', 'xt') as f:\n",
    "    f.write('Hello\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "如果文件是二进制的，使用`xb`来代替`xt`。\n",
    "\n",
    "一个替代方案是先测试这个文件是否存在："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "if not os.path.exists('somefile'):\n",
    "    with open('somefile', 'wt') as f:\n",
    "        f.write('Hello\\n')\n",
    "else:\n",
    "    print('File already exists!')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "## 使用`read()`、`readline()`和`readlines()`读文本文件\n",
    "\n",
    "使用不带参数的`read()`函数一次读入文件的所有内容。可以设置最大的读入字符数限制`read()`函数一次返回的大小，如果没有给定size或者size为负数，文件将被读取直至末尾："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "chunk = 5\n",
    "content = ''\n",
    "fobj = open('allo.txt', 'rt')\n",
    "while 1:\n",
    "    fragment = fobj.read(chunk)\n",
    "    if not fragment:\n",
    "        break\n",
    "    content += fragment\n",
    "\n",
    "fobj.close()\n",
    "len(content)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "读到文件结尾之后，再次调用read()会返回空字符串（''）。这个函数不推荐使用。\n",
    "\n",
    "使用`readline()`每次读入文件的一行，如果指定了`size`，每次读取文件中`size`个字节，返回一个字符串。如果没有给定`size`或者`size`为负数则返回一行（包括行结束符）。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "content = ''\n",
    "fobj = open('argv.py', 'rt')\n",
    "while 1:\n",
    "    line = fobj.readline()\n",
    "    if not line:\n",
    "        break\n",
    "    content += line\n",
    "\n",
    "fobj.close()\n",
    "len(content)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "对于文本文件，空行也有1字符长度（换行符`\\n`），当文件读取结束后，`readline()`会返回空字符串。\n",
    "\n",
    "使用迭代器读取文件，每次返回一行。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "content = ''\n",
    "fobj = open('argv.py', 'rt')\n",
    "for line in fobj:\n",
    "    content += line\n",
    "\n",
    "fobj.close()\n",
    "len(content)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "采用迭代方式读取超大型文件或者网络流文件的好处是显而易见的，避免了一次性将大型文件读入内存所带来的负担（有时候甚至是不可能的，例如网络流文件）。对于小型文件还是一次性读入好些，可以尽快释放文件资源。\n",
    "\n",
    "函数`readlines()`读取剩余的所有的行（文件指针不一定在开始位置！）并将其以一个字符串列表形式返回。此方法一次性读取文件所有的内容至内存中，适用于小型文件。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "fobj = open('argv.py', 'rt')\n",
    "lines = fobj.readlines()\n",
    "fobj.close()\n",
    "print(len(lines))\n",
    "for line in lines:\n",
    "    print(line, end='')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "与`readlines()`相反，可以使用`file_obj.writelines()`来写入文件，它接收一个字符串列表作为参数并将其写入到文件中，每个字符串的行结束符不会被自动写入，通常适用于每个字符串末尾都包含换行符的字符串列表。\n",
    "\n",
    "## 跟踪文件读写位置（移动文件指针）\n",
    "\n",
    "* `tell()`返回距离文件开始处的字节偏移量，类似于C语言中的`ftell`。\n",
    "\n",
    "* `seek(offset[, whence])`跳转到文件其他字节偏移量的位置，同样返回当前的偏移量。其类似于C语言中的`fseek`，`offset`为偏移量，`whence`代表相对位置，是一个可选参数。\n",
    "\n",
    "  * 如果`whence`等于0（默认为0），从开头偏移`offset`个字节；\n",
    "  * 如果`whence`等于1，从当前位置处偏移`offset`个字节；\n",
    "  * 如果`whence`等于2，距离最后结尾处偏移`offset`个字节。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "bdata = bytes(range(0, 256))\n",
    "fobj = open('bfile', 'wb')\n",
    "fobj.write(bdata)\n",
    "fobj.close\n",
    "\n",
    "fobj = open('bfile', 'rb')\n",
    "print(fobj.tell())\n",
    "print(fobj.seek(255))\n",
    "print(fobj.seek(-5, 1))  # # 从当前位置后退5个字节\n",
    "bdata = fobj.read()\n",
    "print(len(bdata))\n",
    "print(bdata)\n",
    "print(bdata[0])\n",
    "fobj.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "## 一些其它方法\n",
    "\n",
    "- `file_obj.fileno()`  返回打开文件的文件描述符，这是一个整形，可以用于os模块的一些底层操作\n",
    "\n",
    "- `file_obj.flush()`   把输出缓冲区内的数据立即写入文件，调用`close`时会自动调用这个方法。\n",
    "\n",
    "- `file_obj.isatty()`  当文件是一个tty设备时，返回`True`。tty是字符型终端设备，例如老式的打印机以及操作系统中的终端（Terminal）程序。\n",
    "\n",
    "- `file_obj.next()`    返回文件的下一行，类似于`readline`方法，没有其它行时引发`StopIteration`异常。\n",
    "\n",
    "# 数据编码和处理\n",
    "\n",
    "## 读写CSV数据\n",
    "\n",
    "使用Python自带的`csv`库来读写CSV格式的文件。假设有一个保存股票市场数据的文件`stocks.csv`：\n",
    "\n",
    "```\n",
    "Symbol,Price,Date,Time,Change,Volume\n",
    "\"AA\",39.48,\"6/11/2007\",\"9:36am\",-0.18,181800\n",
    "\"AIG\",71.38,\"6/11/2007\",\"9:36am\",-0.15,195500\n",
    "\"AXP\",62.58,\"6/11/2007\",\"9:36am\",-0.46,935000\n",
    "\"BA\",98.31,\"6/11/2007\",\"9:36am\",+0.12,104800\n",
    "\"C\",53.08,\"6/11/2007\",\"9:36am\",-0.25,360900\n",
    "\"CAT\",78.29,\"6/11/2007\",\"9:36am\",-0.23,225400\n",
    "```\n",
    "\n",
    "可以通过下面的代码将这些数据读取为一个元组的序列："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "import csv\n",
    "\n",
    "with open('stocks.csv') as f:\n",
    "    f_csv = csv.reader(f)\n",
    "    headers = next(f_csv)\n",
    "    for row in f_csv:\n",
    "        # process row\n",
    "        pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "上面的代码中，`row`会是一个列表，为了访问某个字段，可以使用下标，如`row[0]`访问`Symbol`，`row[4]`访问`Change`。\n",
    "\n",
    "这里可以使用命名元组避免引起混淆："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "from collections import namedtuple\n",
    "\n",
    "with open('stocks.csv') as f:\n",
    "    f_csv = csv.reader(f)\n",
    "    headings = next(f_csv)\n",
    "    Row = namedtuple('Row', headings)\n",
    "    for r in f_csv:\n",
    "        row = Row(*r)\n",
    "        # process row\n",
    "        pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "现在，你可以使用列名如`row.Rymbol`和`row.Change`代替下标访问。需要注意的是这个只有在列名是合法的Python标识符的时候才生效。如果不是的话，你可能需要修改原始的列名(如将非标识符字符替换成下划线之类的)。例如，一个CSV格式文件有一个包含非法标识符的头行，类似下面这样：\n",
    "\n",
    "```\n",
    "Street Address,Num-Premises,Latitude,Longitude 5412 N CLARK,10,41.980262,-87.668452\n",
    "```\n",
    "\n",
    "可以使用正则表达式替换非法标识符："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "with open('stocks.csv') as f:\n",
    "    f_csv = csv.reader(f)\n",
    "    headers = [re.sub('[^a-zA-Z_]', '_', h) for h in next(f_csv)]\n",
    "    Row = namedtuple('Row', headers)\n",
    "    for r in f_csv:\n",
    "        row = Row(*r)\n",
    "        # process row\n",
    "        pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "另外，可以选择将数据读取到一个字典中去："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "import csv\n",
    "\n",
    "with open('stocks.csv') as f:\n",
    "    f_csv = csv.DictReader(f)\n",
    "    for row in f_csv:\n",
    "        # process row\n",
    "        pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "现在，你可以使用列名去访问每一行的数据。比如`row['Symbol']`和`row['Change']`。\n",
    "\n",
    "为了写入CSV数据，需要创建一个`csv.writer`对象。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "headers = ['Symbol','Price','Date','Time','Change','Volume']\n",
    "rows = [\n",
    "    ('AA', 39.48, '6/11/2007', '9:36am', -0.18, 181800),\n",
    "    ('AIG', 71.38, '6/11/2007', '9:36am', -0.15, 195500),\n",
    "    ('AXP', 62.58, '6/11/2007', '9:36am', -0.46, 935000),\n",
    "]\n",
    "\n",
    "with open('stocks.csv', 'w') as f:\n",
    "    f_csv = csv.writer(f)\n",
    "    f_csv.writerow(headers)\n",
    "    f_csv.writerows(rows)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "如果是字典数据的话，可以这样："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "headers = ['Symbol', 'Price', 'Date', 'Time', 'Change', 'Volume']\n",
    "rows = [\n",
    "    {'Symbol':'AA', 'Price':39.48, 'Date':'6/11/2007',\n",
    "     'Time':'9:36am', 'Change':-0.18, 'Volume':181800},\n",
    "    {'Symbol':'AIG', 'Price': 71.38, 'Date':'6/11/2007',\n",
    "     'Time':'9:36am', 'Change':-0.15, 'Volume': 195500},\n",
    "    {'Symbol':'AXP', 'Price': 62.58, 'Date':'6/11/2007',\n",
    "     'Time':'9:36am', 'Change':-0.46, 'Volume': 935000},\n",
    "]\n",
    "\n",
    "with open('stocks.csv', 'w') as f:\n",
    "    f_csv = csv.DictWriter(f, headers)\n",
    "    f_csv.writeheader()\n",
    "    f_csv.writerow(rows)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "csv产生的数据都是字符串类型的，它不会做任何其他类型的转换，如果你需要做这样的转换，你必须自己手动去实现。\n",
    "\n",
    "下面是一个转换字典中特定字段的例子："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "print('Reading as dicts with type conversion')\n",
    "field_types = [\n",
    "    ('Price', float),\n",
    "    ('Change', float),\n",
    "    ('Volume', int)\n",
    "]\n",
    "\n",
    "with open('stocks.csv') as f:\n",
    "    for row in csv.DictReader(f):\n",
    "        row.update((key, conversion(row[key]))\n",
    "                   for key, conversion in field_types)\n",
    "        print(row)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "## 读写JSON数据\n",
    "\n",
    "Python的`json`模块提供了非常简单的方式来编码和解码JSON(JavaScript Object Notation)数据。其中两个主要的函数是`json.dumps()`和`json.loads()`。\n",
    "\n",
    "下面的代码将Python数据结构转换为JSON："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "data =  {\n",
    "    'name' : 'ACME',\n",
    "    'shares' : 100,\n",
    "    'price' : 542.23\n",
    "}\n",
    "\n",
    "json_str = json.dumps(data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "下面的代码将一个JSON编码的字符串转换回一个Python数据结构："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "data = json.loads(json_str)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "如果要处理的是文件而不是字符串，可以使用`json.dump()`和`json.load()`来编码和解码JSON数据。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "# Writing JSON data\n",
    "with open('data.json', 'w') as f:\n",
    "    json.dump(data, f)\n",
    "\n",
    "# Reading data back\n",
    "with open('data.json', 'r') as f:\n",
    "    data = json.load(f)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "JSON编码支持的基本数据类型为`None`，`bool`，`int`，`float`和`str`，以及包含这些类型数据的**lists**，**tuples**和**dictionaries**。对于dictionaries，keys需要是字符串类型(字典中任何非字符串类型的key在编码时会先转换为字符串)。为了遵循JSON规范，你应该只编码Python的lists和dictionaries。\n",
    "\n",
    "JSON编码的格式对于Python语法而已几乎是完全一样的，除了一些小的差异之外。比如，`True`会被映射为`true`，`False`被映射为`false`，而`None`会被映射为`null`。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "print(json.dumps(False))\n",
    "\n",
    "d = {'a': True,\n",
    "     'b': 'Hello',\n",
    "     'c': None}\n",
    "\n",
    "print(json.dumps(d))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "## 解析简单的XML数据\n",
    "\n",
    "可以使用`xml.etree.ElementTree`模块从简单的XML文档中提取数据。下面的例子展示了如何解析Planet Python上的RSS源。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "from urllib.request import urlopen\n",
    "from xml.etree.ElementTree import parse\n",
    "\n",
    "# Download the RSS feed and parse it\n",
    "u = urlopen('http://planet.python.org/rss20.xml')\n",
    "doc = parse(u)\n",
    "\n",
    "# Extract and output tags of interest\n",
    "for item in doc.iterfind('channel/item'):\n",
    "    title = item.findtext('title')\n",
    "    date = item.findtext('pubDate')\n",
    "    link = item.findtext('link')\n",
    "\n",
    "    print(title)\n",
    "    print(date)\n",
    "    print(link)\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "`xml.etree.ElementTree.parse()`函数解析整个XML文档并将其转换成一个文档对象。然后，你就能使用`find()`、`iterfind()`和`findtext()`等方法来搜索特定的XML元素了。这些函数的参数就是某个指定的标签名，例如`channel/item`或`title`。\n",
    "\n",
    "每次指定某个标签时，你需要遍历整个文档结构。每次搜索操作会从一个起始元素开始进行。同样，每次操作所指定的标签名也是起始元素的相对路径。例如，执行`doc.iterfind('channel/item')`来搜索所有在`channel`元素下面的`item`元素。`doc`代表文档的最顶层(也就是第一级的`rss`元素)。然后接下来的调用`item.findtext()`会从已找到的`item`元素位置开始搜索。\n",
    "\n",
    "`ElementTree`模块中的每个元素有一些重要的属性和方法，在解析的时候非常有用。`tag`属性包含了标签的名字，`text`属性包含了内部的文本，而`get()`方法能获取属性值。例如："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "print(doc)\n",
    "\n",
    "e = doc.find('channel/title')\n",
    "print(e)\n",
    "\n",
    "print(e.text)\n",
    "\n",
    "print(e.get('some_attribute'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "对于更高级的应用程序，你需要考虑使用[lxml](http://lxml.de)。它使用了和`ElementTree`同样的编程接口，因此上面的例子同样也适用于lxml。你只需要将刚开始的`import`语句换成`from lxml.etree import parse`就行了。`lxml`完全遵循XML标准，并且速度也非常快，同时还支持验证，XSLT和XPath等特性。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "## 序列化数据\n",
    "\n",
    "序列化（serializing）即存储数据结构到一个文件中。Python提供了`pickle`模块以特殊的二进制格式保存和恢复数据对象。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "import pickle\n",
    "import datetime\n",
    "\n",
    "now1 = datetime.datetime.utcnow()\n",
    "pickled = pickle.dumps(now1)\n",
    "now2 = pickle.loads(pickled)\n",
    "print(now1)\n",
    "print(now2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "使用`pickle.dump()`序列化数据到文件，而函数`pickle.load()`用作反序列化。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  },
  "name": "11_file_io_and_structured_text_files.ipynb",
  "toc": {
   "colors": {
    "hover_highlight": "#ddd",
    "running_highlight": "#FF0000",
    "selected_highlight": "#ccc"
   },
   "moveMenuLeft": true,
   "nav_menu": {
    "height": "260px",
    "width": "252px"
   },
   "navigate_menu": true,
   "number_sections": false,
   "sideBar": true,
   "threshold": 4,
   "toc_cell": false,
   "toc_section_display": "block",
   "toc_window_display": false,
   "widenNotebook": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
